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Abstract 

Data  on  “neural  coding”  have  frequently  been  analyzed  using  information-theoretic  mea¬ 
sures.  These  formulations  involve  the  fundamental,  and  generally  difficult  statistical  problem 
of  estimating  entropy.  We  review  briefly  several  methods  that  have  been  advanced  to  estimate 
entropy,  and  highlight  a  method,  the  coverage  adjusted  entropy  estimator  (CAE),  due  to  Chao 
and  Shen  that  appeared  recently  in  the  environmental  statistics  literature.  This  method  be¬ 
gins  with  the  elementary  Horvitz-Thompson  estimator,  developed  for  sampling  from  a  finite 
population  and  adjusts  for  the  potential  new  species  that  have  not  yet  been  observed  in  the 
sample — these  become  the  new  patterns  or  “words”  in  a  spike  train  that  have  not  yet  been 
observed.  The  adjustment  is  due  to  I.J.  Good,  and  is  called  the  Good- Turing  coverage  estimate. 
We  provide  a  new  empirical  regularization  derivation  of  the  coverage-adjusted  probability  esti¬ 
mator,  which  shrinks  the  MLE.  We  prove  that  the  CAE  is  consistent  and  first-order  optimal, 
with  rate  Op  (1/ log  n),  in  the  class  of  distributions  with  finite  entropy  variance  and  that  within 
the  class  of  distributions  with  finite  gth  moment  of  the  log-likelihood,  the  Good-Turing  cov¬ 
erage  estimate  and  the  total  probability  of  unobserved  words  converge  at  rate  Op(l/(logn)*). 
We  then  provide  a  simulation  study  of  the  estimator  with  standard  distributions  and  examples 
from  neuronal  data,  where  observations  are  dependent.  The  results  show  that,  with  a  minor 
modification,  the  CAE  performs  much  better  than  the  MLE  and  is  better  than  the  Best  Upper 
Bound  estimator,  due  to  Paninski,  when  the  number  of  possible  words  m  is  unknown  or  infinite. 

*To  appear  in  a  special  issue  of  Statistics  in  Medicine  on  neuronal  data  analysis. 


Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 
VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  0MB  control  number. 
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1  Introduction 


The  problem  of  “neural  coding”  is  to  elucidate  the  representation  and  transformation  of  information 
in  the  nervous  system.  [17]  An  appealing  way  to  attack  neural  coding  is  to  take  the  otherwise 
vague  notion  of  “information”  to  be  defined  in  Shannon’s  sense,  in  terms  of  entropy.  [20]  This 
project  began  in  the  early  days  of  cybernetics  [24,  11],  received  considerable  impetus  from  work 
summarized  in  the  book  Spikes:  Exploring  the  Neural  Code  [18] ,  and  continues  to  be  advanced  by 
many  investigators.  In  most  of  this  research,  the  findings  concern  the  mutual  information  between 
a  stimulus  and  a  neuronal  spike  train  response.  For  a  succinct  overview  see  [4].  The  mutual 
information,  however,  is  the  difference  of  marginal  and  expected  conditional  entropies;  to  compute 
it  from  data  one  is  faced  with  the  basic  statistical  problem  of  estimating  the  entropy^ 

H  :=  -'^P{x)logP{x)  (1) 

of  an  unknown  discrete  probability  distribution  P  over  a  possibly  infinite  space  X,  the  data  being 
conceived  as  random  variables  Xi, . . . ,  A„  with  Xi  distributed  according  to  P.  An  apparent  method 
of  estimating  the  entropy  is  to  apply  the  formula  after  estimating  P(x)  for  all  x  G  X,  but  estimating 
a  discrete  probability  distribution  is,  in  general,  a  difficult  nonparametric  problem.  Here,  we 
point  out  the  potential  use  of  a  method,  the  eoverage  adjusted  estimator  (CAE),  due  to  Chao  and 
Shen  [5],  which  views  estimation  of  entropy  as  analogous  to  estimation  of  the  total  of  some  variable 
distributed  across  a  population,  which  in  turn  may  be  estimated  by  a  simple  device  introduced 
by  Horvitz  and  Thompson  [8].  We  provide  an  alternative  derivation  of  this  estimator,  establish 
optimality  of  its  rate  of  convergence,  and  provide  simulation  results  indicating  it  can  perform  very 
well  in  finite  samples-even  when  the  observations  are  mildly  dependent.  The  simulation  results 
for  data  generated  to  resemble  neuronal  spike  trains  are  given  in  Figure  1,  where  the  estimator  is 
labeled  CAE.  In  Section  2  we  provide  background  material.  Section  3  contains  our  derivation  of 
the  estimator  and  the  convergence  result,  and  Section  4  the  description  of  the  simulation  study  and 
additional  simulation  results. 

^Unless  otherwise  stated,  we  take  all  logarithms  to  be  base  2  and  define  OlogO  =  0. 
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VI  VLMC  (T=6) 


Field  L  VLMC  (T=15) 


Figure  1:  Comparison  of  entropy  estimators  in  terms  of  root  mean  squared  error,  as  a  function  of 
sample  size,  for  word  lengths  T  =  6  from  VI  data  (left)  and  T  =  15  from  Field  L  data  (right). 
Full  definitions  are  given  in  Section  4.  The  samples  of  size  n  are  drawn  from  a  stationary  variable 
length  Markov  chain  (VLMC)  [10]  used  to  model  neuronal  data  from  visual  (VI)  and  auditory 
(Field  L)  systems.  We  followed  the  “direct  method”  and  divided  each  sample  sequence  into  words, 
which  are  blocks  of  length  T.  The  plots  display  the  root  mean  squared  error  (RMSE)  of  the 
estimates  of  H/T.  The  RMSE  was  estimated  by  averaging  1000  independent  realizations.  MLE 
is  the  “naive”  empirical  plug-in  estimate.  CAE  is  the  coverage  adjusted  estimator.  BUB-|-  is  the 
BUB  estimator  [16]  with  its  m  parameter  set  to  the  maximum  possible  number  of  words  (VI:  6^ 
=  46,656,  Eield  L:  2^  =  32,768).  BUB-  is  the  BUB  estimator  with  m  set,  naively,  to  the  observed 
number  of  words.  The  actual  values  of  H/T  are  VI:  1.66  and  Eield  L:  0.151.  The  BUB-|-  estimator 
has  a  very  large  RMSE  resulting  from  specifying  m  as  the  maximum  number  of  words.  The  CAE 
estimator  performs  relatively  well,  especially  for  sample  sizes  as  small  as  several  hundred  words. 


2  Background 


In  linguistic  applications,  X  could  be  the  set  of  words  in  a  language,  with  P  specifying  their 
frequency  of  occurrence.  Eor  neuronal  data,  Xi  often  represents  the  number  of  spikes  (action 
potentials)  occurring  during  the  ith  time  bin.  Alternatively,  when  a  fine  resolution  of  time  is  used 
(such  as  At  =  1  millisecond),  the  occurrence  of  spikes  is  indicated  by  a  binary  sequence,  and  Xi 
becomes  the  pattern,  or  “word,”  made  up  of  0-1  words  or  “letters,”  for  the  ith  word.  This  is 
described  in  Figure  2,  and  it  is  the  basis  for  the  widely- used  “direct  method”  proposed  by  Strong 
et  al.  [21].  The  number  of  possible  words  m  :=  j{x  G  A  :  P{x)  >  0}]  is  usually  unknown  and 
possibly  infinite.  In  the  example  in  Figure  2,  the  maximum  number  of  words  is  the  total  number 
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Figure  2:  The  top  row  depicts  45  milliseconds  of  a  hypothetical  spike  train.  The  ticks  on  the  time 
axis  demarcate  At  =  1  millisecond  bins  (intervals).  The  spike  train  is  discretized  into  a  sequence 
of  counts.  Each  count  is  the  number  of  spikes  that  fall  within  a  single  time  bin.  Subdividing 
this  sequence  into  words  of  length  T  =  10  leads  to  the  words  shown  at  the  bottom.  The  words 
Xi,X2, . . .  take  values  in  the  space  X  =  {0, consisting  of  all  0-1  strings  of  length  10. 


of  0-1  strings  of  length  T.  For  T  =  10  this  number  is  1024;  for  T  =  20  it  is  well  over  one  million, 
and  in  general  there  is  an  exponential  explosion  with  increasing  T.  Furthermore,  the  phenomenon 
under  investigation  will  often  involve  fine  time  resolution,  necessitating  a  small  bin  size  At  and 
thus  a  large  T.  For  large  T,  the  estimation  of  P(x)  is  likely  to  be  challenging. 

We  note  that  Strong  et  al.  [21]  calculated  the  entropy  rate.  Let  {Wt  :  t  =  1, 2, . . .}  be  a 
discretized  (according  to  At)  spike  train  as  in  the  example  in  Figure  2.  If  {Wt}  is  a  stationary 
process,  the  entropy  of  a  word,  say  Xi  =  (ITi, . . . ,  Wt),  divided  by  its  length  T  is  non-increasing 
in  T  and  has  a  limit  as  T  ^  00,  i.e. 

lim  ^H{Xi)  =  hm  ^H{Wi, . . .  ,Wt)  =:  H'  (2) 

exists  [6].  This  is  the  entropy  rate  of  {Wt}.  The  word  entropy  is  used  to  estimate  the  entropy 
rate.  If  {Wt}  has  finite  range  dependence,  then  the  above  entropy  factors  into  a  sum  of  conditional 
entropies  and  a  single  marginal  entropy.  Generally,  the  word  length  is  chosen  to  be  large  enough  so 
that  H{Wi, . . . ,  Wt)/T  is  a  close  approximation  to  H' ,  but  not  so  large  that  there  are  not  enough 
words  to  estimate  H{Wi, . . . ,  Wt).  Strong  et  al.  [21]  proposed  that  the  entropy  rate  estimate  be 
extrapolated  from  estimates  of  the  word  entropy  over  a  range  of  word  lengths.  We  do  not  address 
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this  extrapolation,  but  rather  focus  on  the  problem  of  estimating  the  entropy  of  a  word. 

In  the  most  basic  case  the  observations  Xi , ,  Xn  are  assumed  to  be  independent  and  identi¬ 
cally  distributed  (i.i.d.).  Without  loss  of  generality,  we  assume  that  df  C  N  and  that  the  words  ^ 
are  labeled  1,2,....  The  seemingly  most  natural  estimate  is  the  empirical  plug-in  estimator 

H  ■.=  -Y,Hx)logP{x),  (3) 

X 

which  replaces  the  unknown  probabilities  in  (1)  with  the  empirical  probabilities  P{x)  :=  nxjn, 
that  is  the  observed  proportion  Ux/n  of  occurrences  of  the  word  x  in  Xi, . . . ,  Xn.  The  empirical 
plug-in  estimator  is  often  called  the  “naive”  estimate  or  the  “MLE” -after  the  fact  that  P  is  the 
maximum  likelihood  estimate  of  P.  We  will  use  “MLE”  and  “empirical  plug-in”  interchangeably. 
From  Jensen’s  Inequality  it  is  readily  seen  that  the  MLE  is  negatively  biased  unless  P  is  trivial.  In 
fact  no  unbiased  estimate  of  entropy  exists,  see  [16]  for  an  easy  proof. 

In  the  finite  m  case,  Basharin  [3]  showed  that  the  MLE  is  biased,  consistent,  and  asymptotically 
normal  with  variance  equal  to  the  entropy  variance  Var[log  P(Xi)].  Miller  [13]  previously  studied 
the  bias  independently  and  provided  the  formula 

EF-F  =  -"^^  +  0(l/n2).  (4) 

The  bias  dominates  the  mean  squared  error  of  the  estimator  [1],  and  has  been  the  focus  of  recent 
studies  [23,  16]. 

The  original  “direct  method”  advocated  an  ad-hoc  strategy  of  bias  reduction  based  on  a  sub¬ 
sampling  extrapolation  [21].  A  more  principled  correction  based  on  the  jackknife  technique  was 
proposed  earlier  by  Zahl  [27].  The  formula  (4)  suggests  a  bias  correction  of  adding  (m  —  l)/(2n) 
to  the  MLE.  This  is  known  as  the  Miller-Maddow  correction.  Unfortunately,  it  is  an  asymptotic 
correction  that  depends  on  the  unknown  parameter  m.  Paninski  [16]  observed  that  both  the  MLE 
and  Miller-Maddow  estimates  fall  into  a  class  of  estimators  that  are  linear  in  the  frequencies  of  ob- 

^The  information  theory  literature  traditionally  refers  to  X  as  an  alphabet  and  its  elements  as  symbols.  It  is 
natural  to  call  a  tuple  of  symbols  a  word,  but  the  problem  of  estimating  the  entropy  of  the  T-tuple  word  reduces  to 
that  of  estimating  the  entropy  in  an  enlarged  space  (of  T-tuples). 
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served  word  counts  fj  =  \{nx  ■  nx  =  j}\-  He  proposed  an  estimate,  “Best  Upper  Bounds”  (BUB), 
based  on  numerically  minimizing  an  upper-bound  on  the  bias  and  variance  of  such  estimates  when 
m  is  assumed  finite  and  known.  We  note  that  in  the  case  that  m  is  unknown,  it  can  be  replaced 
by  an  upper-bound,  but  the  performance  of  the  estimator  is  degraded. 

Bayesian  estimators  have  also  been  proposed  for  the  finite  m  case  by  Wolpert  and  Wolf  [25]. 
Their  approach  is  to  compute  the  posterior  distribution  of  entropy  based  on  a  symmetric  Dirichlet 
prior  on  P.  Nemenman  et  al.  [14]  found  that  the  Dirichlet  prior  on  P  induces  a  highly  concentrated 
prior  on  entropy.  They  argued  that  this  property  is  undesirable  and  proposed  an  estimator  based 
on  a  Dirichlet  mixture  prior  with  the  goal  of  flattening  the  induced  prior  distribution  on  entropy. 
Their  estimate  requires  a  numerical  integration  and  also  the  unknown  parameter  m,  or  at  least 
an  upper-bound.  The  estimation  of  m  is  even  more  difficult  than  the  estimation  of  entropy  [1], 
because  it  corresponds  to  estimating  hmaj^o 

In  the  infinite  m  case,  Antos  and  Kontoyiannis  [1]  proved  consistency  of  the  empirical  plug-in 
estimator  and  showed  that  there  is  no  universal  rate  of  convergence  for  any  estimator.  However, 
Wyner  and  Foster  [26]  have  shown  that  the  best  rate  (to  first  order)  for  the  class  of  distributions 
with  with  finite  entropy  variance  or  equivalently  finite  log-likelihood  second  moment 

^  P(x)(log  P{x)  f  <  oo  (5) 

X 

is  Op(l/logn).  This  rate  is  achieved  by  the  empirical  plug-in  estimate  as  well  as  an  estimator 
based  on  match  lengths.  Despite  the  fact  that  the  empirical  plug-in  estimator  is  asymptotically 
optimal,  its  finite  sample  performance  leaves  much  to  be  desired. 

Chao  and  Shen  [5]  proposed  a  coverage  adjusted  entropy  estimator  intended  for  the  case  when 
there  are  potentially  unseen  words  in  the  sample.  This  is  always  the  case  when  m  is  relatively  large 
or  infinite.  Intuitively,  low  probability  words  are  typically  absent  from  most  sequences,  i.e.  the 
expected  sample  coverage  is  <  1,  but  in  total,  the  missing  words  can  have  a  large  contribution  to 
H.  The  estimator  is  based  on  plug-in  of  a  coverage  adjusted  version  of  the  empirical  probability 
into  the  Horvitz-Thompson  [8]  estimator  of  a  population  total.  They  presented  simulation  results 
showing  that  the  estimator  seemed  to  perform  quite  well,  especially  in  the  small  sample  size  regime. 
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when  compared  to  the  usual  empirical  plug-in  and  several  bias  corrected  variants.  The  estimator 
does  not  require  knowledge  of  m,  but  they  assumed  a  finite  m.  We  prove  here  (Theorem  1)  that 
the  coverage  adjusted  estimator  also  works  in  the  infinite  m  case.  Chao  and  Shen  also  provided 
approximate  confidence  intervals  for  the  coverage  adjusted  estimate,  however  they  are  asymptotic 
and  depend  on  the  assumption  of  finite  m. 

The  problems  of  entropy  estimation  and  estimation  of  the  distribution  P  are  distinct.  Entropy 
estimation  should  be  no  harder  than  estimation  of  P,  since  H  is  a  functional  of  P.  However,  several 
of  the  entropy  estimators  considered  here  depend  either  implicitly  or  explicitly  on  estimating  P. 
BUB  is  linear  in  the  frequency  of  observed  word  counts  fj,  and  those  are  1-to-l  with  the  empirical 
distribution  P  up  to  labeling.  In  general,  any  symmetric  estimator  is  a  function  of  P.  The  only 
estimators  mentioned  above  that  does  not  depend  on  P  is  the  match  length  estimator.  For  the 
coverage  adjusted  estimator,  the  dependence  on  estimating  P  is  only  through  estimating  P{k)  for 
observed  words  k. 

3  Theory 

Unobserved  words — those  that  do  not  appear  in  the  sample,  but  have  non-zero  probability-can 
have  a  great  impact  on  entropy  estimation.  However,  these  effects  can  be  mitigated  with  two  types 
of  corrections:  Horvitz-Thompson  adjustment  and  coverage  adjustment  of  the  probability  estimate. 
Section  3.1  contains  an  exposition  of  some  of  these  effects.  The  adjustments  are  described  in  Section 
3.2  along  with  the  definition  of  the  resulting  coverage  adjusted  entropy  estimator.  A  key  ingredient 
of  the  estimator  is  a  coverage  adjusted  probability  estimate.  We  provide  a  novel  derivation  from 
the  viewpoint  of  regularization  in  Section  3.3.  Lastly,  Section  3.4  concludes  the  theoretical  study 
with  our  rate  of  convergence  results. 

Throughout  this  section  we  assume  that  Xi, . . .  ,Xn  is  an  i.i.d.  sequence  from  the  distribution 
P  on  the  countable  set  X.  Without  loss  of  generality,  we  assume  that  the  P{k)  >  0  for  all  k  £  X 


6 


and  write  pk  for  P{k)  =  P(Xi  =  k).  As  before,  m  :=  \X\  and  possibly  m  =  oo.  Let 

n 

Uk  :='^l{Xi  =  k}  (6) 

i=l 

be  the  number  of  times  that  the  word  k  appears  in  the  sequence  Xi, . . .  ,Xn,  with  !{•}  denoting 
the  indicator  of  the  event  {•}. 

3.1  The  Unobserved  Word  Problem 

The  set  of  observed  words  S  is  the  set  of  words  that  appear  at  least  once  in  the  sequence  Xi, 

i.e. 

S  :=  {k  :  Uk  >  0}.  (7) 

The  complement  of  S,  i.e.  X\S,  is  the  set  of  unobserved  words.  There  is  always  a  non-zero 
probability  of  unobserved  words,  and  if  m  >  n  or  m  =  oo  then  there  are  always  unobserved  words. 
In  this  section  we  describe  two  effects  of  the  unobserved  words  pertaining  to  entropy  estimation. 
Given  the  set  of  observed  words  S,  the  entropy  of  P  can  be  written  as  the  sum  of  two  parts: 

H  = -'^Pklogpk -'^Pk^ogpk.  (8) 

k&S  k^S 

One  part  is  the  contribution  of  observed  words;  the  other  is  the  contribution  of  unobserved  words. 
Suppose  for  a  moment  that  pk  is  known  exactly  for  k  G  S,  but  unknown  for  k  ^  S.  Then  we  could 
try  to  estimate  the  entropy  by 

-'^Pklogpk,  (9) 

kes 

but  there  would  be  an  error  in  the  estimate  unless  the  sample  coverage 

C:='£Pk  (10) 

kes 

is  identically  1.  The  error  is  due  to  the  contribution  of  unobserved  words  and  thus  the  unobserved 
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summands: 


-'^Pklogpk.  (11) 

k^S 

This  error  could  be  far  from  negligible,  and  its  size  depends  on  the  pk  for  k  ^  S.  However,  there  is 
an  adjustment  that  can  be  made  so  that  the  adjusted  version  of  (9)  is  an  unbiased  estimate  of  H. 
This  adjustment  comes  from  the  Horvitz-Thompson  [8]  estimate  of  a  population  total,  and  we  will 
review  it  in  Section  3.2. 

Unfortunately,  pk  is  unknown  for  both  k  G  S  and  k  ^  S.  A  common  estimate  for  pk  is  the 
MLE/empirical  pk  ■=  nk/n.  Plugging  this  estimate  into  (9)  gives  the  MLE/empirical  plug-in 
estimate  of  entropy: 

-^Pfclogpfc  = -^PfclogPfc,  (12) 

k  kes 

because  Pk  =  0  for  all  k  ^  S.  If  the  sample  coverage  C  is  <  1,  then  this  is  a  degenerate  estimate 
because  J2k&sPi^  ~  ^  so  =  0  for  all  k  ^  S.  Thus,  we  could  shrink  the  estimate  of  pk  on 
S  toward  zero  so  that  its  sum  over  S'  is  <  1.  This  is  the  main  idea  behind  the  coverage  adjusted 
probability  estimate,  however  we  will  derive  it  from  the  viewpoint  of  regularization  in  Section  3.3. 

We  have  just  seen  that  unobserved  words  can  have  two  negative  effects  on  entropy  estimation: 
unobserved  summands  and  error-contaminated  summands.  The  “size,”  or  non-coverage,  of  the  set 
of  unobserved  words  can  be  measured  by  1  minus  the  sample  coverage: 


l-C  =  Y,Pk  =  nXn+i^S\S).  (13) 

k^S 

Thus,  it  is  also  the  conditional  probability  that  a  future  observation  is  not  a  previously 

observed  word.  So  the  average  non-coverage  is 


E(1  -C)=  nXn+i  iS)  =  Y,pk{l-pkT.  (14) 

k 

and  in  general  E(1  —  C)  >  0.  Its  rate  of  convergence  to  0,  as  n  ^  oo,  depends  on  P  and  can  be 
very  slow.  (See  the  corollary  to  Theorem  2  below).  It  is  necessary  to  understand  how  to  mitigate 
the  effects  of  unobserved  words  on  entropy  estimation. 


3.2  Coverage  Adjusted  Entropy  Estimator 

Chao  and  Shen  [5]  observed  that  entropy  can  be  thought  of  as  the  total  yk  of  an  unknown  pop¬ 
ulation  consisting  of  elements  yk  =  — pfclogpfc.  For  the  general  problem  of  estimating  a  population 
total,  the  Horvitz-Thompson  estimator  [8], 


provides  an  unbiased  estimate  of  under  the  assumption  that  the  inclusion  probabilities 

F{k  G  S)  and  yk  are  known  for  k  G  S.  For  the  i.i.d.  sequence  Ai, . . . ,  Xn  the  probability  that  word 
k  is  unobserved  in  the  sample  is  {1  —  pk)^-  So  the  inclusion  probability  is  1  —  (1  —  pk)^-  Then  the 
Horvitz-Thompson  adjusted  version  of  (9)  is 


-Pk  logpfc 


(16) 


All  that  remains  is  to  estimate  pk  for  A:  G  S'.  The  empirical  pk  can  be  plugged  into  the  above 
formula,  however,  as  we  stated  in  the  previous  section,  it  is  a  degenerate  estimate  when  C  <  1 
because  it  assigns  0  probability  iok  ^  S  and,  thus,  tends  to  overestimates  the  inclusion  probability. 
We  will  discuss  this  further  in  Section  3.3. 

In  a  related  problem,  Ashbridge  and  Goudie  [2]  considered  finite  populations  with  elements 
yfc  =  1,  so  that  (15)  becomes  an  estimate  of  the  population  size.  They  found  that  P  did  not  work 
well  and  suggested  using  instead  a  coverage  adjusted  estimate  P  :=  CP,  where  C  is  an  estimate  of 
C.  Chao  and  Shen  recognized  this  and  proposed  using  the  Good- Turing  [7,  19]  coverage  estimator: 


C  :=  1  -  — , 
n 


(17) 


where  /i  :=  Yhk  ^{'^k  =  1}  is  the  number  of  singletons  in  the  sequence  Xi,. . . ,  A„.  This  leads  to 
the  coverage  adjusted  entropy  estimator: 


EPk  logPfc 

l-{l-pkY 


(18) 
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where  pk  ■=  Cpk-  Chao  and  Shen  gave  an  argument  for  CP  based  on  a  conditioning  property  of 
the  multinomial  distribution.  In  the  next  section  we  give  a  different  derivation  from  the  perspective 
of  regularization  of  an  empirical  risk,  and  give  upper-bounds  for  the  bias  and  variance  of  C. 

3.3  Regularized  Probability  Estimation 

Consider  the  problem  of  estimating  P  under  the  entropy  loss  L{q,x)  =  —  logQ(a;),  for  Q  satisfying 
Q{k)  =  Qk  P  0  and  '^qk  =  1-  This  loss  function  is  closely  aligned  with  the  problem  of  entropy 
estimation  because  the  risk,  i.e.  the  expected  loss  on  a  future  observation. 


R{Q)  := -ElogQ{Xn+i) 


(19) 


is  uniquely  minimized  hy  Q  =  P  and  its  optimal  value  is  the  entropy  of  P.  The  MLE  P  minimizes 
the  empirical  version  of  the  risk 

1  ” 

R{Q)  := —T\ogQ{Xi).  (20) 

n 

i=\ 

As  stated  previously  in  Section  3.1,  this  is  a  degenerate  estimate  when  there  are  unobserved  words. 
More  precisely,  if  the  expected  coverage  EC  <  1  (which  is  true  in  general),  then  i?(P)  =  oo. 

Analogously  to  (8),  the  expectation  in  (19)  can  be  split  into  two  parts  by  conditioning  on 
whether  Xn+i  is  a  previously  observed  word  or  not: 


R{Q)  =  -  E[logg(A„+i)|A„+i  G  5]  P(A„+i  G  5) 
-  E[logg(A„+i)|A„+i  i  S\  P(A„+1  i  S). 


(21) 


Since  P(A„+i  G  S)  does  not  depend  on  g,  minimizing  (21)  with  respect  to  Q  is  equivalent  to 
minimizing 

-E[logg(A„+i)|A„+i  G  5]  -  A*E[logg(A„+i)|A„+i  i  5],  (22) 

where  A*  =  P(A„+i  ^  5)/P(A„_|_i  G  S).  We  cannot  distinguish  the  probabilities  of  the  unobserved 
words  on  the  basis  of  the  sample.  So  consider  estimates  Q  which  place  constant  probability  on 
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X  ^  S.  Equivalently,  these  estimates  treat  the  unobserved  words  as  a  single  class  and  so  the  risk 
reduces  to  the  equivalent  form: 


-E[logg(X„+i)|X„+i  G  5] 


A*Elog 


l-Y,Qik) 

fees' 


(23) 


The  above  expectations  only  involve  evaluating  Q  at  observed  words.  Thus,  (20)  is  more  natural  as 
an  estimate  of  — E[log  g(X„+i)|X„+i  G  5],  than  as  an  estimate  of  R{Q).  If  we  let  A  be  any  estimate 
of  the  odds  ratio  A*  =  E(X„+i  ^  5)/P(X„+i  G  S),  then  we  arrive  at  the  regularized  empirieal  risk, 


R{q;  A) 


-  y'logg(Xi)  -  A  log 
n 

i 


l-Y,Qix^) 

i 


(24) 


This  is  the  usual  empirical  risk  with  an  additional  penalty  on  the  total  mass  assigned  to  observed 
words.  It  can  be  verified  that  the  minimizer,  up  to  an  equivalence,  is  (1  +  X)~^P.  This  estimate 
shrinks  the  MLE  towards  0  by  the  amount  (1  +  A)“^.  Any  Q  which  agrees  with  (1  +  A)~^P  on  S  is  a 
minimizer  of  (24).  Note  that  (1  +  A*)“^  =  E(A„+i  G  S)  =  EC  is  the  expected  coverage,  rather  than 
the  sample  coverage  C.  C  can  be  used  to  estimate  both  EC  and  C,  however  it  is  actually  better  as 
an  estimate  of  EC  because  McAllester  and  Schapire  [12]  have  shown  that  C  =  C  +  Op{logn/^/n), 
whereas  we  prove  in  the  appendix  the  following  proposition. 


Proposition  1.  0  >  E(C—C)  =  — ^  >  (1  — l/n)""  ^jn^—e  ^  jn  andYsx  C  <  Ain. 

So  C  is  a  l/-v/n  consistent  estimate  of  EC.  Using  C  to  estimate  EC  =  (1  +  A*)“^,  we  obtain 
the  coverage  adjusted  probability  estimate  P  =  CP. 


3.4  Convergence  Rates 

In  the  infinite  m  case,  Antos  and  Kontoyiannis  [1]  proved  that  the  MLE  is  universally  consistent 
almost  surely  and  in  L? ,  provided  that  the  entropy  exists.  However,  they  also  showed  that  there 
can  be  no  universal  rate  of  convergence  for  entropy  estimation.  Some  additional  restriction  must 
be  made  beyond  the  existence  of  entropy  in  order  to  obtain  a  rate  of  convergence.  Wyner  and 
Eoster  [26]  found  that  for  the  weakest  natural  restriction,  YlkPki).ogpkf‘  <  oo,  the  best  rate  of 
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convergence,  to  first  order,  is  Op(l/logn).  They  proved  that  the  MLE  and  an  estimator  based 
on  match  lengths  achieves  this  rate.  Our  main  theoretical  result  is  that  the  coverage  adjusted 
estimator  also  achieves  this  rate. 

Theorem  1.  Suppose  that  J2kPki^^SPk)‘^  <  oo.  Then  as  n  ^  oo, 

H  =  H  +  Op{l/logn).  (25) 

In  the  previous  section  we  employed  C  =  1  —  fijn,  in  the  regularized  empirical  risk  (24).  As 
for  the  observed  sample  coverage,  C  =  P(A„+i  G  ISIS'),  McAllester  and  Schapire  [12]  proved  that 
C  =  P(An+i  G  S\S)  +  Op{logn/^/n),  regardless  of  the  underlying  distribution.  Our  theorem  below 
together  with  McAllester  and  Schapire’s  implies  a  rate  of  convergence  on  the  total  probability  of 
unobserved  words. 

Theorem  2.  Suppose  that  J2kPk  \  logpfc]'^  <  oo.  Then  as  n  ^  oo,  almost  surely, 

C'  =  l-0(l/(logn)«).  (26) 

Corollary  1.  Suppose  that  '^j^Pkl  logp^l^  <  oo.  Then  as  n  ^  oo, 

1-C  =  F{Xn+i  ^S\S)=  Op{l/{logny).  (27) 

Proof.  This  follows  from  the  above  theorem  and  Theorem  3  of  [12]  which  implies  |C  —  P(Ajj_|_i  G 
ISIS') I  <  op(l/(logn)'?)  because 

0  <  P(A„+i  ^  S\S)  <  |1  -  C|  +  |C  -  F{Xn+i  G  S\S)\  (28) 

and  Op(l/(logn)'^)  +  op(l/(logn)^)  =  Op(l/(logn)'^).  ■  ■ 

We  defer  the  proofs  of  Theorems  1  and  2  to  Appendix  A.  At  the  time  of  writing,  the  only  other 
entropy  estimators  proved  to  be  consistent  and  asymptotically  first-order  optimal  in  the  finite 
entropy  variance  case  that  we  are  aware  of  are  the  MLE  and  Wyner  and  Eoster’s  modified  match 
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length  estimator.  However,  the  Op(l/logn)  rate,  despite  being  optimal,  is  somewhat  discouraging. 
It  says  that  in  the  worst  case  we  will  need  an  exponential  number  of  samples  to  estimate  the  entropy. 
Furthermore,  the  asymptotics  are  unable  to  distinguish  the  coverage  adjusted  estimator  from  the 
MLE,  which  has  been  observed  to  be  severely  biased.  In  the  next  section  we  use  simulations  to 
study  the  small-sample  performance  of  the  coverage  adjusted  estimator  and  the  MLE,  along  with 
other  estimators.  The  results  suggest  that  in  this  regime  their  performances  are  quite  different. 

4  Simulation  Study 

We  conducted  a  large  number  of  simulations  under  varying  conditions  to  investigate  the  performance 
of  the  coverage  adjusted  estimator  (CAE)  and  compare  with  four  other  estimators: 

•  Empirical  Plug-in  (MLE):  defined  in  (3). 

•  Miller-Maddow  corrected  MLE  (MM):  based  on  the  asymptotic  bias  formula  provided  by 

Miller  [13]  and  Basharin  [3].  It  is  derived  from  equation  (4)  by  estimating  m  by  the  number 
of  distinct  words  observed  rh  =  Yhk  adding  (m  —  l)/(2n)  to  the  MLE. 

•  Jackknife  (JK):  proposed  by  Zahl  [27].  It  is  a  bias-corrected  version  of  the  MLE  obtained  by 
averaging  all  n  leave-one-out  estimates. 

•  Best  Upper  Bounds  (BUB):  proposed  by  Paninski  [16].  It  is  obtained  by  numerically  mini¬ 
mizing  a  worst  case  error  bound  for  a  certain  class  of  linear  estimators  for  a  distribution  with 
known  support  size  m. 

The  NSB  estimator  proposed  by  [14]  was  not  included  in  our  simulation  comparison  because  of 
problems  with  the  software  and  its  computational  cost.  We  also  tried  their  asymptotic  formula  for 
their  estimator  in  the  “infinite  (or  unknown)”  m  case: 

V'(l)/ln(2)  —  1 -|- 21ogn  —  V'(n  —  m),  (29) 

where  ^^{z)  =  T'{z)/T{z)  is  the  digamma  function.  However,  we  were  also  unable  to  get  it  to  work 
because  it  seemed  to  increase  unboundedly  with  the  sample  size,  even  for  m  =  oo  cases. 
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There  are  two  sets  of  experiments  consisting  of  multiple  trials.  The  first  set  of  experiments 
concern  some  simple,  but  popular  model  distributions.  The  second  set  of  experiments  deal  with 
neuronal  data  recorded  from  primate  visual  and  avian  auditory  systems.  It  departs  from  the 
theoretical  assumptions  of  Section  3  in  that  the  observations  are  dependent. 

Chao  and  Shen  [5]  also  conducted  a  simulation  study  of  the  coverage  adjusted  estimator  for 
distributions  with  small  m  and  showed  that  it  performs  reasonably  well  even  when  there  is  a 
relatively  large  fraction  of  unobserved  words.  Their  article  also  contains  examples  from  real  data 
sets  concerning  diversity  of  species.  The  experiments  presented  here  are  intended  to  complement 
their  results  and  expand  the  scope. 

Practical  Considerations 

We  encountered  a  few  practical  hurdles  when  performing  these  experiments.  The  first  is  that  the 
coverage  adjusted  estimator  is  undefined  when  the  sample  consists  entirely  of  singletons.  In  this 
case  (7  =  0  and  p  =  0.  The  probability  of  this  event  decays  exponentially  fast  with  the  sample 
size,  so  it  is  only  an  issue  for  relatively  small  samples.  To  deal  with  this  matter  we  replaced  the 
denominator  n  in  the  definition  of  C  with  n  +  1.  This  minor  modification  does  not  affect  the 
asymptotic  behavior  of  the  estimator,  and  allows  it  to  be  defined  for  all  cases. ^ 

The  BUB  estimator  assumes  that  the  number  of  words  m  is  finite  and  requires  that  it  be 
specified,  m  is  usually  unknown,  but  sometimes  an  upper-bound  on  m  may  be  assumed.  To 
understand  the  effect  of  this  choice  we  tried  three  different  variants  on  the  BUB  estimator’s  m 
parameter: 

•  Understimate  (BUB-):  The  naive  m  as  defined  above  for  the  Miller-Maddow  corrected  MLE. 

•  Oracle  value  (BUB.o):  The  true  m  in  the  finite  case  and  [2^]  in  the  infinite  case. 

•  Overestimate  (BUB-I-):  Twice  the  oracle  value  for  the  first  set  of  experiments  and  the  maxi¬ 
mum  number  of  words  1^71  for  the  second  set  of  neuronal  data  experiments. 

^Another  variation  is  to  add  .5  to  the  numerator  and  1  to  the  denominator. 
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support  {k  =) 

Pk 

H 

Var[logp(A)] 

Uniform 

1,...,1024 

1/1024 

10 

0 

Zipf 

1,...,1024 

k-^/Ekk-^ 

7.51 

9.59 

Poisson 

1,...,(X) 

1024'’’/(fc!ei°24) 

7.05 

1.04 

Geometric 

l,...,oo 

(1023/1024)'=-Vl024 

11.4 

2.08 

Table  1:  Standard  models  considered  in  the  first  set  of  experiments. 

Although  the  BUB  estimator  is  undefined  for  the  m  infinite  case,  we  still  tried  using  it,  defining  the 
m  parameter  of  the  oracle  estimator  to  be  [ 2^] .  This  is  motivated  by  the  Asymptotic  Equipartition 
Property  (AEP)  [6],  which  roughly  says  that,  asymptotically,  2^  is  the  effective  support  size  of  the 
distribution.  There  are  no  theoretical  guarantees  for  this  heuristic  use  of  the  BUB  estimator,  but 
it  did  seem  to  work  in  the  simulation  cases  below.  Again,  this  is  an  oracle  value  and  not  actually 
known  in  practice.  The  implementation  of  the  estimator  was  adapted  from  software  provided  by 
the  author  of  [16]  and  its  numerical  tuning  parameters  were  left  as  default. 

Experimental  Setup 

In  each  trial  we  sample  from  a  single  distribution  and  compute  each  estimator’s  estimate  of  the 
entropy.  Trials  are  repeated,  with  1,000  independent  realizations. 

Standard  Models  We  consider  the  four  discrete  distributions  shown  in  Table  1.  The  uniform 
and  truncated  Zipf  distributions  have  finite  support  {m  =  1 , 024) ,  while  the  Poisson  and  geometric 
have  infinite  support.  The  Zipf  distribution  is  very  popular  and  often  used  to  model  linguistic  data. 
It  is  sometimes  referred  to  as  a  “power  law.”  We  generated  i.i.d.  samples  of  varying  sample  size  (n) 
from  each  distribution  and  computed  the  respective  estimates.  We  also  considered  the  distribution 
of  distinct  words  in  James  Joyce’s  novel  Ulysses.  We  found  that  results  were  very  similar  to  that 
of  the  Zipf  distribution  and  did  not  include  them  in  this  article. 

Neuronal  Data  Here  we  consider  two  real  neuronal  data  sets  first  presented  in  [22] .  A  subset  of 
the  data  are  available  from  the  Neural  Prediction  Challenge^.  We  fit  a  variable  length  Markov  chain 
(VLMC)  to  subsets  of  each  data  set  and  treated  the  fitted  models  as  the  truth.  Our  goal  was  not 

"^http :  / /neuralpredict  ion .  berkeley .  edu 
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depth  (msec) 

A 

word  length  T 

|A| 

H 

H/T 

Field  L  VLMC  232  (232) 

10 

1,024 

1.51 

0.151 

232  (232) 

{0,1}'^ 

15 

32,768 

2.26 

0.150 

VI  VLMC  3  (48) 

{0,1,...,5}5 

5 

7,776 

8.32 

1.66 

3  (48) 

{0,1,...,5}6 

6 

46,656 

9.95 

1.66 

Table  2:  Fitted  VLMC  models.  Entropy  {H)  was  computed  by  Monte  Carlo  with  10®  samples  from 
the  stationary  distribution.  H/T  is  the  entropy  of  the  word  divided  by  its  length. 


to  model  the  neuronal  data  exactly,  but  to  construct  an  example  which  reflects  real  neuronal  data, 
including  any  inherent  dependence.  This  experiment  departs  from  the  assumption  of  independence 
for  the  theoretical  results.  See  [10]  for  a  general  overview  of  the  VLMC  methodology. 

From  the  first  data  set,  we  extracted  10  repeated  trials,  recorded  from  a  single  neuron  in  the  Field 
L  area  of  avian  auditory  system  during  natural  song  stimulation.  The  recordings  were  discretized 
into  At  =  1  millisecond  bins  and  consist  of  sequences  of  O’s  and  I’s  indicating  the  absence  or 
presence  of  a  spike.  We  concatenated  the  ten  recordings  before  fitting  the  VLMC  (with  state  space 
{0, 1}).  A  complete  description  of  the  physiology  and  other  information  theoretic  calculations  from 
the  data  can  be  found  in  [9] . 

The  other  data  set  contained  several  separate  single  neuron  recording  sequences  from  the  VI 
area  of  primate  visual  system,  during  a  dynamic  natural  image  stimulation.  We  used  the  longest 
contiguous  sequence  from  one  particular  trial.  This  consisted  of  3,449  spike  counts,  ranging  from  0 
to  5.  The  counts  are  number  of  spikes  occurring  during  consecutive  At  =  16  millisecond  periods. 
(For  the  VI  data,  the  state  space  of  the  VLMC  is  {0, 1,2, 3, 4,  5}).  The  resulting  fits  for  both  data 
sets  are  shown  in  Table  2.  Note  that  for  each  VLMC,  H/T  is  nearly  the  same  for  both  choices  of 
word  length  (cf.  the  remarks  under  equation  (2)  in  Section  2). 

The  (maximum)  depth  of  the  VLMC  is  a  measure  of  time  dependence  in  the  data.  For  the  Field 
L  data,  the  dependence  is  long,  with  the  VLMC  looking  232  time  periods  (232  msec)  into  the  past. 
This  may  reflect  the  nature  of  the  stimulus  in  the  Field  L  case.  For  the  VI  data,  the  dependence 
is  short  with  the  fitted  VLMC  looking  only  3  time  periods  (48  msec)  into  the  past. 

Samples  of  length  n  were  generated  from  the  stationary  distribution  of  the  fitted  VLMCs.  We 
subdivided  each  sample  into  non-overlapping  words  of  length  T.  Figure  2  shows  this  for  the  Field 
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L  model  with  T  =  10.  We  tried  two  different  word  lengths  for  each  model.  The  word  lengths  and 
entropies  are  shown  in  Table  2.  We  then  computed  each  estimator’s  estimate  of  entropy  on  the 
words  and  divided  by  the  word  length  to  get  an  estimate  of  the  entropy  rate  of  the  word. 

We  treat  m  as  unknown  in  this  example  and  did  not  include  the  oracle  BUB.o  in  the  experiment. 
We  used  the  maximum  possible  value  of  m,  i.e.  \At\  for  BUB+.  In  the  case  of  Field  L  with  T  =  10, 
this  is  1,024.  The  other  values  are  shown  in  Table  2. 

Results 

Standard  Models  The  results  are  plotted  in  Figures  3,  4.  It  is  surprising  that  good  estimates 
can  be  obtained  with  just  a  few  observations.  Estimating  m  from  its  empirical  value  marginally 
improves  MM  over  the  MLE.  The  naive  BUB-,  which  also  uses  the  empirical  value  of  m,  performs 
about  the  same  as  JK. 

Bias  apparently  dominates  the  error  of  most  estimators.  The  CAE  estimator  trades  away  bias 
for  a  moderate  amount  of  variance.  The  RMSE  results  for  the  four  distributions  are  very  similar. 
The  CAE  estimator  performs  consistently  well,  even  for  smaller  sample  sizes,  and  is  competitive 
with  the  oracle  BUB.o  estimator.  The  Zipf  distribution  example  seems  to  be  the  toughest  case  for 
the  CAE  estimator,  but  it  still  performs  relatively  well  for  sample  sizes  of  at  least  1,000. 

Neuronal  Data  The  results  are  presented  in  Eigures  5  and  6.  The  effect  of  the  dependence  in 
the  sample  sequences  is  not  clear,  but  all  the  estimators  seem  to  be  converging  to  the  truth.  CAE 
consistently  performs  well  for  both  VI  and  Eield  L,  and  really  shines  in  the  VI  example.  However, 
for  Eield  L  there  is  not  much  difference  between  the  estimators,  except  for  BUB+. 

BUB+  uses  m  equal  to  the  maximum  number  of  words  \X\  and  performs  terribly  because  the 
data  are  so  sparse.  The  maximum  entropy  corresponding  to  \Af\  is  much  larger  than  the  actual 
entropy.  In  the  Field  L  case,  the  maximum  entropies  are  10  and  15,  while  the  actual  entropies  are 
1.51  and  2.26.  In  the  VI  case,  the  maximum  entropies  are  12.9  and  15.5,  while  the  actual  entropies 
are  8.32  and  9.95.  This  may  be  the  reason  that  the  BUB+  estimator  has  such  a  large  positive 
bias  in  both  cases,  because  the  estimator  is  designed  to  approximately  minimize  a  balance  between 
upper-bounds  on  worst  case  bias  and  variance. 
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Uniform  Distribution 


Zipf  Distribution 


Figure  3:  The  two  distributions  considered  here  have  finite  support,  with  m  =  1,024.  (Left)  The 
estimated  entropy  for  several  different  estimators,  over  a  range  of  sample  sizes  n.  The  lines  are 
average  estimates  taken  over  1,000  independent  realizations,  and  the  vertical  bars  indicate  ±  one 
standard  deviation  of  the  estimate.  The  actual  value  of  H  is  indicated  by  a  solid  gray  horizontal 
line.  MM  and  JK  are  the  Miller-Maddow  and  Jackknife  corrected  MLEs.  BUB-,  BUB.o,  and  BUB+ 
are  the  BUB  estimator  with  its  m  parameter  set  to  a  naive  m,  oracle  m  =  1024,  and  twice  the 
oracle  m.  CAE  is  the  coverage  adjusted  estimator.  (Right)  The  corresponding  root  mean  squared 
error  (RMSE).  Bias  dominates  most  estimates.  Eor  the  uniform  distribution,  CAE  and  BUB.o  have 
relatively  small  biases  and  perform  very  well  for  sample  sizes  as  small  as  several  hundred.  Eor  the 
Zipf  case,  the  CAE  estimator  performs  nearly  as  well  as  the  oracle  BUB.o  for  sample  sizes  larger 
than  500. 
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Poisson  Distribution 


Geometric  Distribution 


Figure  4:  The  two  distributions  considered  here  have  infinite  support,  with  m  =  oo.  (Left)  The 
estimated  entropy  for  several  different  estimators,  over  a  range  of  sample  sizes  n.  The  lines  are 
average  estimates  taken  over  1,000  independent  realizations,  and  the  vertical  bars  indicate  ±  one 
standard  deviation  of  the  estimate.  The  actual  value  of  H  is  indicated  by  a  solid  gray  horizontal 
line.  MM  and  JK  are  the  Miller-Maddow  and  Jackknife  corrected  MLEs.  BUB-,  BUB.o,  and 
BUB+  are  the  BUB  estimator  with  its  m  parameter  set  to  a  naive  m,  oracle  m  =  [2^],  and  twice 
the  oracle  m.  CAE  is  the  coverage  adjusted  estimator.  (Right)  The  corresponding  root  mean 
squared  error  (RMSE).  Results  are  very  similar  to  those  in  the  previous  figure,  the  CAE  estimator 
performs  nearly  as  well  as  the  oracle  BUB.o. 
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Field  LVLMC  (T=10) 


Field  LVLMC  (T=15) 


Figure  5:  (Left)  The  estimated  entropy  rate  for  several  different  estimators.  Samples  of  size  n  are 
drawn  from  a  stationary  VLMC  used  to  model  neuronal  data  from  Field  L  of  avian  auditory  system. 
A  single  sample  corresponds  to  1  millisecond  of  recording  time.  We  followed  the  “direct  method” 
and  divided  each  sample  sequence  into  words  of  length  T.  In  the  top  row  the  word  length  is  T  =  10 
and  the  maximum  number  of  words  |  Aj  is  1,024.  In  the  bottom  row  T  =  15  and  |  Aj  =  32,  768.  The 
lines  are  average  estimates  taken  over  1,000  independent  realizations,  and  the  vertical  bars  indicate 
±  one  standard  deviation  of  the  estimate.  The  actual  value  of  H/T  is  indicated  by  a  solid  gray 
horizontal  line.  MM  and  JK  are  the  Miller-Maddow  and  Jackknife  corrected  MLEs.  BUB-  and 
BUB-|-  are  the  BUB  estimator  with  its  m  parameter  set  to  a  naive  m  and  the  maximum  possible 
number  of  words  \X\:  1,024  for  the  top  row  and  32,768  for  the  bottom.  CAE  is  the  coverage  adjusted 
estimator.  (Right)  The  corresponding  root  mean  squared  error  (RMSE).  The  BUB-|-  estimator 
has  a  considerably  large  bias  in  both  cases.  The  CAE  estimator  has  a  moderate  balance  of  bias 
and  variance  and  shows  a  visible  improvement  over  all  other  estimators  in  the  larger  (T  =  15)  word 
case. 
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VI  VLMC  (T=5) 


V1  VLMC  (T=6) 


Figure  6:  (Left)  The  estimated  entropy  rate  for  several  different  estimators.  The  samples  of  size  n 
are  drawn  from  a  stationary  VLMC  used  to  model  neuronal  data  from  VI  of  primate  visual  system. 
A  single  sample  corresponds  to  16  milliseconds  of  recording  time.  We  followed  the  “direct  method” 
and  divided  each  sample  sequence  into  words  of  length  T.  In  the  top  row  the  word  length  is  T  =  5 
and  the  maximum  number  of  words  \X\  is  7,776.  In  the  bottom  row  T  =  6  and  \X\  =  46,656.  The 
lines  are  average  estimates  taken  over  1,000  independent  realizations,  and  the  vertical  bars  indicate 
±  one  standard  deviation  of  the  estimate.  The  actual  value  of  H/T  is  indicated  by  a  solid  gray 
horizontal  line.  MM  and  JK  are  the  Miller-Maddow  and  Jackknife  corrected  MLEs.  BUB-  and 
BUB-|-  are  the  BUB  estimator  with  its  m  parameter  set  to  a  naive  m  and  the  maximum  possible 
number  of  words:  7,776  for  the  top  row  and  46,656  for  the  bottom.  CAE  is  the  coverage  adjusted 
estimator.  (Right)  The  corresponding  root  mean  squared  error  (RMSE).  The  CAE  estimator  has 
the  smallest  bias  and  performs  much  better  than  the  other  estimators  across  all  sample  sizes. 
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Summary 


The  coverage  adjusted  estimator  is  a  good  choice  for  situations  where  m  is  unknown  and/or  infinite. 
In  these  situations,  the  use  of  an  estimator  which  requires  specification  of  m  is  disadvantageous 
because  a  poor  estimate  (or  upper-bound)  of  m,  or  the  “effective”  m  in  the  infinite  case,  leads 
to  further  error  in  the  estimate.  BUB.o,  which  used  the  oracle  m,  performed  well  in  most  cases. 
However,  it  is  typically  not  available  in  practice,  because  m  is  usually  unknown. 

The  Miller-Maddow  corrected  MLE,  which  used  the  empirical  value  of  m,  improved  on  the 
MLE  only  marginally.  BUB-,  which  is  BUB  with  the  empirical  value  of  m,  performed  better  than 
the  MLE.  It  appeared  to  work  in  some  cases,  but  not  others.  Eor  BUB-I-,  where  we  overestimated 
or  upper-bounded  m  (by  doubling  the  oracle  m,  or  using  the  maximal  IT’D,  the  bias  and  RMSE 
increased  significantly  over  BUB.o  for  small  sample  sizes.  It  appeared  to  work  in  some  cases,  but  not 
others-always  alternating  with  BUB-.  In  the  case  of  the  neuronal  data  models,  BUB-|-  performed 
very  poorly.  In  situations  like  this,  even  though  an  upper-bound  on  m  is  known,  it  can  be  much 
larger  than  the  “effective”  m,  and  result  in  a  gross  error. 

5  Conclusions 

Our  study  has  emphasized  the  value  of  viewing  entropy  estimation  as  a  problem  of  sampling  from 
a  population,  here  a  population  of  words  made  up  of  spike  train  sequence  patterns.  The  coverage 
adjusted  estimator  performed  very  well  in  our  simulation  study,  and  it  is  very  easy  to  compute. 
When  the  word  length  m  is  known,  the  BUB  estimator  can  perform  better.  In  practice,  however, 
m  is  usually  unknown  and,  as  seen  in  VI  and  Eield  L  examples,  assuming  an  upper  bound  on  it 
can  result  in  a  large  error.  The  coverage-adjusted  estimator  therefore  appears  to  us  to  be  a  safer 
choice. 

Other  estimates  of  the  probabilities  of  observed  words,  such  as  the  profile-based  estimator 
proposed  by  Orlitsky  et  al.  [15],  might  be  used  in  place  of  P  in  the  coverage  adjusted  entropy 
estimator  but  that  is  beyond  the  scope  of  this  article. 

The  VI  and  Eield  L  examples  have  substantial  dependence  structure,  yet  methods  derived 
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under  the  i.i.d.  assumption  continue  to  perform  well.  It  may  be  shown  that  both  the  direct  method 
and  the  coverage-adjusted  estimator  remain  consistent  under  the  relatively  weak  assumption  of 
stationarity  and  ergodicity,  but  the  rate  of  convergence  will  depend  on  mixing  conditions.  On 
the  other  hand,  in  the  non-stationary  case  these  methods  become  inconsistent.  Stationarity  is, 
therefore,  a  very  important  assumption.  We  intend  to  discuss  these  issue  at  greater  length  in  a 
separate  paper. 

As  is  clear  from  our  simulation  study,  the  dominant  source  of  error  in  estimating  entropy  is 
often  bias,  rather  than  variance,  which  is  typically  not  captured  from  computed  standard  errors. 
An  important  problem  for  future  investigation  would  therefore  involve  data-driven  estimation  of 
bias  in  the  case  of  unknown  or  infinite  m. 
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A  Proofs 


We  first  prove  Theorem  2.  The  proof  builds  on  the  following  application  of  a  standard  concentration 
technique. 

Lemma  1.  (7  ^  1  almost  surely. 

Proof.  Consider  the  number  of  singletons  /i  as  a  function  of  x”  =  (xi, . . . ,  Xn).  Modifying  a  single 
coordinate  of  xf  changes  the  number  of  singletons  by  at  most  2  because  the  number  of  words 
affected  by  such  a  change  is  at  most  2.  Hence  C  =  1  —  fifn  changes  by  at  most  2/n.  Using 
McDiarmid’s  method  of  bounded  differences,  i.e.  the  Hoeffding-Azuma  Inequality,  gives 

P(|C-EC|  >  e)  <  (30) 

and  by  consequence  of  the  Borel-Cantelli  Lemma,  \C  —  ECj  ^  0  almost  surely.  To  show  that 
EC  ^  1,  we  note  that  1  >  (1  —  Pk)^~^  0  for  all  >  0  and 

11 -ECl  =E- Vlinfc  =  1}  (31) 

n 

k 

=  ^Pkil-Pkr-^^0  (32) 

k 

as  n  — >  oo  by  the  Bounded  Convergence  Theorem.  ■ 

Proof  of  Proposition  1.  The  bias  is 


EC  -  E(X„+i  G  5)  =  E(X„+i  ^  5)  -  E(1  -  C)  (33) 

=  (34) 

k  k 

= -^Pki'^  -  PkT~^  ■  (35) 

k 

This  quantity  is  trivially  non-positive,  and  a  little  bit  of  calculus  shows  that  the  bias  is  maximized 
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by  the  uniform  distribution  pj^  =  1/n: 


YPki.'^  ~ 

^  >  Pk  max  x(l  —  x) 

(36) 

k 

k  - 

=  max  x(l  —  x)*^”^ 

0<x<l 

(37) 

=  (1  -  l/n)'^-Vn 

(38) 

The  variance  bound  can  be  deduced  from  equation  (30),  because  Var  (7  =  PdC  —  ECp  >  x)dx 
and  (30)  implies 


PdC-ECp  >  x)dx  <  /  2e-2”^dx  =  4/n. 


(39) 


Proof  of  Theorem  2.  From  (30)  we  conclude  that  C  =  EC  +  Op{n  ^/^).  So  it  suffices  to  show  that 
EC  =  1  +  0(l/(logn)'?).  Let  Cn  =  lf\/n.  We  split  the  summation  in  (32): 

|1-EC|=  p,(l-p,)-i+  ^  (40) 

k:pk<^n  k:pk>e„ 

Using  Lemma  2  below,  the  first  term  on  the  right  side  is 

Pki'^-PkT~^<  Y  Pk  =  O {1/ (log ny)  (41) 

k-.pk<en  k-.pk<fin 

The  second  term  is 


Pkil-PkT- 

■'<(l-en)"-'  Pk 

(42) 

k:pk>en 

k.pk>en 

<  (1  -  en)"-' 

(43) 

<  exp(— (n  —  l)/\/n) 

(44) 

by  the  well-known  inequality  1  -|-  x  <  e^. 

■ 
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Lemma  2  (Wyner  and  Foster  [26]). 


^  ^  “  log(l/e)9 


Proof.  Since  log(l/x)  is  a  decreasing  function, 


Vk 

k:pk<e 


log 


Pk 


>  Y.  Pk 

k:p^<e 


log 


and  then  we  collect  the  log(l/e)'^  term  to  the  left  side  to  derive  the  claim. 


(45) 


Proof  of  Theorem  1.  Using  the  result  of  Wyner  and  Foster  that  under  the  above  assumptions, 
H  =  H  +  Op(l  / log n) ,  it  suffices  to  show  \P[  —  H\  =  Op(l  /  log  n) .  All  summations  below  will  only 
be  over  k  such  that  >  0  or  >  0.  It  is  easily  verified  that 


h-h=-y: 

k 

=  -E 


Pk  logPfc 
i-{i-PkY ' 

c 


E 


k 


Cpk  log  C 

1-{1-Pk)'^ 

'V' 

Rn 


Pk^OgPk 

-  1  I  PklogPk 


(46) 

(47) 


(48) 


To  bound  Rn  we  will  use  the  Op{l/ {log  rif‘)  rate  of  C  from  Theorem  2.  Note  that  Cjn  <  Cpk  = 
Pfc  <  1  and  by  the  decreasing  nature  of  1/[1  —  (1  —  Pfc)”] 


I  R  I  ^  I  ^1  -  _  I  log  C\ 

“  1  -  (1  -  C/n)^  1  -  (1  -  C/n)^ 

By  Lemma  1,  (7  ^  1  almost  surely  and  since  Xn  ^  I  implies  (1  —  Xnl'n)'^  e 
~  I  log (71/(1  -  e~^)  =  Op(l/(logn)2).  As  for  Dn, 


(49) 


the  right  side  is 


\Dr. 


\c-i\  +  {i-pkr 

i-{i-PkY 


PklogPk 


(50) 
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and  since  pk  >  C /n  whenever  >  0, 


y] - log  Pk  <  — — — ^ - H 


1  —  e~^ 


H 


=  Op(l/(logn) 


(51) 

(52) 

(53) 


because  H  is  consistent.  The  remaining  part  of  will  require  a  bit  more  work  and  we  will  split 
it  according  to  the  size  of  pk-  Let  =  logn/n.  Then 


E 


1  -  (1  -  Pk)' 


-Pfciogpfc  =  - 

k:pk<er 

-  E 

k:pk>e. 


{i-PkT 

1  -  (1  -Pk) 

{i-PkT 

1  -  (1  -pk) 


-Pk  logPfc 


-Pk  logPfc 


(54) 


/-| _ \n 

Similarly  to  our  previous  argument,  is  decreasing  in  p^.  So  the  second  summation  on  the 

right  side  is 


-  E 

^•Pk 


{i-Pkr 

i-{i-Pk)^ 


PklogPk  < 


(1  -  enr 

Op{l/n) 


(55) 

(56) 


For  the  remaining  summation,  we  use  the  fact  that  pk  >  C jn  and  the  monotonicity  argument  once 
more. 


-  E 

k:pk<tn 


{i-PkT 

i-{i-PkY 


Pk'^OgPk  < 


{l-C/nY 

1  -  (1  -  C/nY 


^  Pklogpk 

k:pk<(-n 


(57) 


By  the  consistency  of  C,  the  leading  term  converges  to  the  constant  e  ^/(l  —  e  ^)  and  can  be 
ignored.  Since  —  logpfc  <  logn. 


-  X]  PfclogPfc<logn  ^  Pk  (58) 

k:pk<^n  k:pk<en 
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We  split  the  summation  once  last  time,  but  according  to  the  size  of  pk- 


log  n  E  Pk  <  log  n 

k-.pk<<^n 


Y.  Pk 

k-.pk>l/\/n  k-.pk<\ly/n 


(logn)2  -- 

< - = - hlogn  2^  Pfc, 

k-.pk<^IVn 


n 


(59) 

(60) 


where  we  have  used  the  fact  that  \{k  :  pk  >  1/ ^/n}\  <  ^/n.  Taking  expectation,  applying  Lemma  2 
and  Markov’s  Inequality  shows  that 


=  log  n  Y  Pk  =  Op{l/ log  n) 

k-.pk<^ly/n 

The  proof  is  complete  because  (logn)^/y^  is  also  0(1/ log  n). 


(61) 
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